source("utils.R")
theme_set(theme_minimal())Irregular Lattices
Dependencies
Until now, we have considered the cells to be represented in a point pattern. However, as cells have a shape and area, this might be an oversimplification in some cases. Alternatively, we can rely on the segmentation of individual cells that are available for various datasets. The outline of each cell is represented by a polygon and the collection of all cells can be seen as an irregular lattice. Unlike a regular lattice (e.g., spot-based spatial transcriptomics data), the sample areas in an irregular lattice can have different sizes and are not necessarily regularly distributed over the sample space.
For this representation of the cells we will rely on the SpatialFeatureExperiment package. For preprocessing of the dataset we refer the reader to the vignette of the voyager package (Moses et al. 2023). The voyager package also provides wrapper functions around the package spdep (Pebesma and Bivand 2023) that work directly on the SpatialFeatureExperiment object.
(sfe <- HeNSCLCData())class: SpatialFeatureExperiment
dim: 980 100290
metadata(0):
assays(1): counts
rownames(980): AATK ABL1 ... NegPrb22 NegPrb23
rowData names(3): means vars cv2
colnames(100290): 1_1 1_2 ... 30_4759 30_4760
colData names(17): Area AspectRatio ... nCounts nGenes
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : CenterX_global_px CenterY_global_px
imgData names(1): sample_id
unit: full_res_image_pixels
Geometries:
colGeometries: centroids (POINT), cellSeg (POLYGON)
Graphs:
sample01:
# Empty cells
colData(sfe)$is_empty <- colData(sfe)$nCounts < 1
# Select, sum negative control probes
(neg_inds <- str_detect(rownames(sfe), "^NegPrb")) %>% sum[1] 20
colData(sfe)$prop_neg <- colSums(counts(sfe)[neg_inds,])/colData(sfe)$nCounts
# Remove low quality cells
sfe <- sfe[,!sfe$is_empty & sfe$prop_neg < 0.1]
# Re-calculate stats
rowData(sfe)$means <- rowMeans(counts(sfe))
rowData(sfe)$vars <- rowVars(counts(sfe))
rowData(sfe)$is_neg <- neg_inds
# log Counts
sfe <- logNormCounts(sfe)In this vignette, we will show the metrics related to two marker genes: KRT17 (basal cells) and TAGLN (smooth muscle cells).
plotSpatialFeature(sfe, c("KRT17"),
colGeometryName = "centroids",
ncol = 2, scattermore = TRUE) +
theme_void()plotSpatialFeature(sfe, c("TAGLN"),
colGeometryName = "centroids",
ncol = 2, scattermore = TRUE) +
theme_void()Irregular lattice and neighbourhood matrix
In general, a spatial autocorrelation measure has the following elements: a function, \(f(x_i, x_j)\), of the values of interest at locations \(i\) and \(j\) and a neighbourhood or weight matrix, \(w_{ij}\). The function \(f\) relates the measured values \(x_i\) and \(x_j\) and the neighbourhood matrix \(w_{ij}\) builds in the spatial relationship between location \(i\) and \(j\). If \(i\) and \(j\) are not neighbours, i.e. we assume they do not have any spatial association, the corresponding element of the weights matrix is 0 (i.e., \(w_{ij} = 0\)). In the following we will see that the function \(f\) varies between the different spatial autocorrelation measures (Zuur, Ieno, and Smith 2007; Pebesma and Bivand 2023).
Neighbourhood matrix
One of the challenges when working with (irregular) lattice data is the construction of a neighbourhood graph (Pebesma and Bivand 2023). The main question is, what to consider as neighbours, as this will affect downstream analyses. Various methods exist to create neighbours, such as contiguitiy based neighbours, graph-based neighbours, distance based neighbours or higher order neighbours (Getis 2009; Zuur, Ieno, and Smith 2007; Pebesma and Bivand 2023). The documentation of the package spdep gives an overview of the different methods.
Segmentation of individual cells is challenging (Wang 2019), therefore construction of contiguity-based neighbours based on individual cell segmentation assumes perfect segmentation results. Furthermore it would neglect the influence of more distant, not directly adjacent neighbours, which based on the feature of interest might not be the correct assumption.
In the following, we will use a k-nearest neighbours approach from voyager that relies on the spdep package. The approach and in particular the number \(k\) is somewhat arbritary.
colGraph(sfe, "knn5") <- findSpatialNeighbors(sfe, method = "knearneigh",
dist_type = "idw", k = 5,
style = "W")Global Methods
Global methods give us an overview over the entire field of view and summarize the spatial distribution of the cells (for a given measurement such as gene expression) in a single value. The most common global measures are Moran’s I and Geary’s C coefficients; these coefficients are a function of the weight matrix and the variables of interest.
In general, a global spatial autocorrelation measure has the form of a double sum over all locations \[\sum_i \sum_j f(x_i,x_j) w_{ij} .\]
Global Moran’s I coefficient
The global Moran’s I (Moran 1950) coefficient is a measure of spatial autocorrelation, defined as:
\[I = \frac{n}{\sum_i\sum_j w_{ij}} \frac{\sum_i\sum_j w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_i (x_i - \bar{x})^2}, \]
where \(x_i\) and \(x_j\) represent the values of the variable of interest at locations \(i\) and \(j\), \(\hat{x}\) is the mean of all \(x\) and \(w_{ij}\) is the spatial weight between the locations of \(i\) and \(j\). The expected value is close to \(0\) for large \(n\) (\(\mathbb{E}(I) = -1/(n-1)\)), whereas a value higher than indicates spatial auto-correlation. Negative values indicate negative auto-correlation.
Implementation using voyager
calculateMoransI(sfe,
features = c("KRT17", "TAGLN"),
colGraphName = "knn5",
exprs_values = "logcounts"
)DataFrame with 2 rows and 2 columns
moran K
<numeric> <numeric>
KRT17 0.643966 5.09219
TAGLN 0.262370 6.55766
We can also use the moran.mc function to calculate the Moran’s I coefficient. This function uses a Monte Carlo simulation to calculate the P-value.
sfe <- runUnivariate(sfe,
features = c("KRT17", "TAGLN"),
colGraphName = "knn5",
exprs_values = "logcounts",
type = "moran.mc",
nsim = 200)
res <- rowData(sfe)[c("KRT17", "TAGLN"),]
#value of the metric
res$moran.mc_statistic_sample01[1] 0.6439663 0.2623699
#P-value
res$moran.mc_p.value_sample01[1] 0.004975124 0.004975124
We can see both genes have a positive Moran’s I coefficient and a highly significant P-value. The expected value is \(\mathbb{E}(I) = -1/(n-1)\) which is for large \(N\) close to 0. Positive and significant values indicate that areas with similar values are clustered. It is important to note that this could be both at the high or low end of the values of interest. Negative values indicate clustering of alternating values, i.e., gives a measure of spatial heterogeneity. Moreover, one should note that the result is dependent on the neighbourhood matrix. Different neighbourhood matrices will give different results. To compare Moran’s I coefficients between different values, we need to use the same neighbourhood matrix.
Global’s Geary’s C coefficient
Geary’s \(C\) (Geary 1954) is a different measure of global autocorrelation and is very closely related to Moran’s \(I\). However, it focuses on spatial dissimilarity. Geary’s \(C\) is defined by
\[ C = \frac{(n-1) \sum_i \sum_j w_{ij}(x_i-x_j)^2}{2\sum_i \sum_j w_{ij}\sum_i(x_i-\bar{x})^2} \] where \(x_i\) and \(x_j\) represent the values of the variable of interest at locations \(i\) and \(j\), \(\hat{x}\) is the mean of all \(x\), \(w_{ij}\) is the spatial weight between the locations of \(i\) and \(j\) and \(n\) the total numer of locations. The interpretation is opposite to Moran’s \(I\): a value smaller than \(1\) indicates positive auto-correlation whereas a value greater than \(1\) represents negative auto-correlation.
Implementation using voyager
sfe <- runUnivariate(sfe,
features = c("KRT17", "TAGLN"),
colGraphName = "knn5", nsim = 200,
type = "geary.mc")
res <- rowData(sfe)[c("KRT17", "TAGLN"),]
#value of the metric
res$geary.mc_statistic_sample01[1] 0.3563535 0.7259509
#P-value
res$geary.mc_p.value_sample01[1] 0.004975124 0.004975124
Again, the value of Geary’s \(C\) indicates that the genes are spatially auto-correlated.
Global Getis-Ord \(G\) statistic
The global \(G\) (Getis and Ord 1992) statistic is a generalisation of the local version (see below) and summarises the contributions of all pairs of values \((x_i, x_j)\) in the dataset. Formally that is
\[ G(d) = \frac{\sum_{i = 1}^n \sum_{j=1}^n w_{ij}(d)x_ix_j}{\sum_{i = 1}^n \sum_{j=1}^n x_i x_j}, \text{s.t } j \neq i \]
The global \(G(d)\) statistic is very similar to global Moran’s \(I\). The global \(G(d)\) statistic is based on the sum of the products of the datapoints whereas global Moran’s \(I\) is based on the sum of the covariances. Since these two approaches capture different aspects of a structure, their values will differ as well. A good approach would be to not use one statistic in isolation but rather consider both.
It is recommended to use binary weights for this calculation. We will use the spdep package directly to calculate the global \(G\) statistic.
Implementation using spdep
# Get the weight matrix from sfe object
weights_neighbourhoods_binary <- colGraph(sfe)
# Change it to binary weights
weights_neighbourhoods_binary$style <- "B"
# Calculate the global G statistic
spdep::globalG.test(x = counts(sfe)["KRT17",],
listw = weights_neighbourhoods_binary)
Getis-Ord global G statistic
data: counts(sfe)["KRT17", ]
weights: weights_neighbourhoods_binary
standard deviate = 280.88, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Global G statistic Expectation Variance
5.408783e-05 9.990609e-06 2.464837e-14
Local measures
Unlike global measures that give an overview over the entire field of view, local measures report information about the statistic at each location (cell). There exist local analogs of Moran’s I and Geary’s C for which the global statistic can be represented as a weighted sum of the local statistics. As above, the local coefficients are based on both the spatial weights matrix and the values of the measurement of interest.
Local Moran’s I coefficient
The local Moran’s I coefficient is a measure of spatial autocorrelation on each location of interest. It is defined as:
\[I_i = \frac{x_i - \bar{x}}{\sum_{k=1}^n(x_k-\bar{x})^2/(n-1)} \sum_{j=1}^n w_{ij}(x_j - \bar{x})\] where the index \(i\) refers to the location for which the measure is calculated. The interpretation is analogous to the global Moran’s I where a value of \(I_i\) higher than \(\mathbb{E}(I) = -1/(n-1)\) indicates spatial auto-correlation; smaller values indicate negative auto-correlation. It is important to note that, as for the global counterpart, the value of local Moran’s I could be a result from both the high or low end of the values. Since we measure and test a large number of locations simultaneously, we need to correct for multiple testing (e.g., using the Benjamini-Hochberg procedure).
Implementation using voyager
sfe <- runUnivariate(sfe,
features = c("KRT17", "TAGLN"),
type = "localmoran")
plotLocalResult(sfe, "localmoran",
features = c("KRT17", "TAGLN"), ncol = 2,
colGeometryName = "centroids")Local Geary’s C coefficient
Similar to local Moran’s I, there is a local Geary’s C (Anselin 1995) coefficient. It is defined as
\[C_i = \sum_{j=1}^n w_{ij}(x_i-x_j)^2\]
The interpretation is analogous to the global Geary’s C (value less than \(1\) indicates positive auto-correlation, a value greater than \(1\) highlights negative auto-correlation).
In this example, we will not plot the local Geary’s C coefficient for gene expression but for features that are associated with an individual cell, e.g., the number of counts or the number of genes expressed. For this, the colDataUnivariate function is used to calculate the local Geary’s C coefficient for such features.
Implementation using voyager
sfe <- runUnivariate(sfe,
features = c("KRT17", "TAGLN"),
type = "localC")
plotLocalResult(sfe, "localC",
features = c("KRT17", "TAGLN"), ncol = 2,
colGeometryName = "centroids")Local Getis-Ord statistic
The local Getis-Ord \(G_i\) (J. K. Ord and Getis 1995; Getis and Ord 1992) statistic quantifies the weighted concentration of points within a radius \(d\) and in a local region \(i\), according to:
\[ G_i(d) = \frac{\sum_{j \neq i } w_{ij}(d)x_j}{\sum_{j \neq i} x_j}. \]
There is a variant of this statistic, \(G_i^*(d)\), which is the same as \(G_i(d)\) except that the contribution when \(j=i\) is included in the term.
Implementation using voyager
sfe <- runUnivariate(sfe,
features = c("KRT17", "TAGLN"),
#include_self = TRUE, # this would specify G_i^*(d)
type = "localG")
plotLocalResult(sfe, "localG",
features = c("KRT17", "TAGLN"), ncol = 2,
colGeometryName = "centroids")The results above gives an estimate of the local Getis-Ord statistic for each cell, but no siginificance value. This can be done by using a permutation approach. As this computationally very intensive we will show it on a subregion of the tissue.
bbox_use <- st_as_sfc(st_bbox(c(xmin = 3200, xmax = 16800, ymin = 155200, ymax = 166200)))
sfe_sub <- sfe[,st_intersects(colGeometries(sfe)$centroids, bbox_use, sparse = FALSE)]
# run localG with permutation test
sfe_sub <- runUnivariate(sfe_sub,
features = "KRT17",
#include_self = TRUE, # this would specify G_i^*(d)
type = "localG_perm")
p1 <- plotLocalResult(sfe_sub, "localG_perm",
features = "KRT17",
colGeometryName = "centroids")
p2 <- plotLocalResult(sfe_sub, "localG_perm",
features = "KRT17",
attribute = "-log10p_adj Sim",
colGeometryName = "centroids")
p1 | p2Positive values indicate clustering of high values, i.e., hot spots, and negative values indicate clustering of low values, i.e., cold spots. The method does not detect outlier values because, unlike in local Moran’s I, there is no cross-product between \(i\) and \(j\). But unlike local Moran’s I, we know the type of interaction exists (high-high or low-low) between \(i\) and \(j\).
Local spatial heterosceadiscity (LOSH)
The local spatial heteroscedasticity (LOSH) is a measure of spatial autocorrelation that is based on the variance of the local neighbourhood. Unlike the other measures, this method does not assume homoscedastic variance over the whole tissue region. LOSH is defined as:
\[H_i(d) = \frac{\sum_j w_{ij}(d)|e_j(d)|^a}{\sum_j w_{ij}(d)},\]
where \(e_j(d) = x_j - \bar{x}_i(d), j \in N(i,d)\) are the local residuals that are subtracted from the local mean. The power \(a\) modulates the interpretation of the residuals (\(a=1\): residuals are interpreted as absolute deviations from the local mean; \(a=2\): residuals are interpreted as deviations from the local variance).
The LOSH should be interpreted in combination with the local Getis-Ord \(G_i^*\) statistic. The \(G_i^*\) quantifies the local mean of the variable of interest, while \(H_i\) quantifies the local variance. This table provided by Ord and Getis (J. Keith Ord and Getis 2012) summarizes the interpretation of the combination of \(G_i^*\) and \(H_i\).
| high \(H_i\) | low \(H_i\) | |
|---|---|---|
| large \(\|G_i^*\|\) | A hot spot with heterogeneous local conditions | A hot spot with similar surrounding areas; the map would indicate whether the affected region is larger than the single “cell” |
| small $ |G_i^*| $ | Heterogeneous local conditions but at a low average level (an unlikely event) | Homogeneous local conditions and a low average level |
# run localG with permutation test
sfe_sub <- runUnivariate(sfe_sub,
features = "KRT17",
type = "LOSH")
plotLocalResult(sfe_sub, "LOSH",
features = "KRT17",
colGeometryName = "centroids")A note of caution
The local methods presented above should always be interpreted with care, since we face the problem of multiple testing when calculating them for each cell. Moreover, the presented methods should mainly serve as exploratory measures to identify interesting regions in the data. Multiple processes can lead to the same pattern, thus from identifying the pattern we cannot infer the underlying process. Indication of clustering does not explain why this occurs. On the one hand, clustering can be the result of spatial interaction between the variables of interest. We have an accumulation of a gene of interest in one region of the tissue. On the other hand clustering can be the result spatial heterogeneity, when local similarity is created by structural heterogeneity in the tissue, e.g., that cells with uniform expression of a gene of interest are grouped together which then creates the apparent clustering of the gene expression measurement.
Appendix
Session info
sessionInfo()R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] stringr_1.5.1 dixon_0.0-9
[3] splancs_2.01-44 spdep_1.3-1
[5] spData_2.3.0 tmap_3.3-4
[7] scater_1.29.4 scran_1.28.2
[9] scuttle_1.10.3 SFEData_1.2.0
[11] SpatialFeatureExperiment_1.2.3 Voyager_1.2.7
[13] rgeoda_0.0.10-4 digest_0.6.34
[15] ncf_1.3-2 sf_1.0-15
[17] reshape2_1.4.4 patchwork_1.2.0
[19] STexampleData_1.8.0 ExperimentHub_2.8.1
[21] AnnotationHub_3.8.0 BiocFileCache_2.11.1
[23] dbplyr_2.4.0 RANN_2.6.1
[25] seg_0.5-7 sp_2.1-2
[27] rlang_1.1.3 ggplot2_3.4.4
[29] dplyr_1.1.4 mixR_0.2.0
[31] spatstat_3.0-7 spatstat.linnet_3.1-3
[33] spatstat.model_3.2-8 rpart_4.1.19
[35] spatstat.explore_3.2-5 nlme_3.1-162
[37] spatstat.random_3.2-2 spatstat.geom_3.2-7
[39] spatstat.data_3.0-4 SpatialExperiment_1.10.0
[41] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2
[43] Biobase_2.60.0 GenomicRanges_1.52.1
[45] GenomeInfoDb_1.36.4 IRanges_2.34.1
[47] S4Vectors_0.38.2 BiocGenerics_0.46.0
[49] MatrixGenerics_1.12.3 matrixStats_1.2.0
loaded via a namespace (and not attached):
[1] spatstat.sparse_3.0-3 bitops_1.0-7
[3] httr_1.4.7 RColorBrewer_1.1-3
[5] tools_4.3.1 utf8_1.2.4
[7] R6_2.5.1 HDF5Array_1.28.1
[9] mgcv_1.8-42 rhdf5filters_1.12.1
[11] withr_3.0.0 gridExtra_2.3
[13] leaflet_2.2.1 leafem_0.2.3
[15] cli_3.6.2 labeling_0.4.3
[17] proxy_0.4-27 R.utils_2.12.3
[19] dichromat_2.0-0.1 scico_1.5.0
[21] limma_3.56.2 rstudioapi_0.15.0
[23] RSQLite_2.3.4 generics_0.1.3
[25] crosstalk_1.2.1 Matrix_1.6-5
[27] ggbeeswarm_0.7.2 fansi_1.0.6
[29] abind_1.4-5 R.methodsS3_1.8.2
[31] terra_1.7-65 lifecycle_1.0.4
[33] yaml_2.3.8 edgeR_3.42.4
[35] rhdf5_2.44.0 tmaptools_3.1-1
[37] grid_4.3.1 blob_1.2.4
[39] promises_1.2.1 dqrng_0.3.2
[41] crayon_1.5.2 lattice_0.21-8
[43] beachmat_2.16.0 KEGGREST_1.40.1
[45] magick_2.8.2 pillar_1.9.0
[47] knitr_1.45 metapod_1.7.0
[49] rjson_0.2.21 boot_1.3-28.1
[51] codetools_0.2-19 wk_0.9.1
[53] glue_1.7.0 vctrs_0.6.5
[55] png_0.1-8 gtable_0.3.4
[57] cachem_1.0.8 xfun_0.41
[59] S4Arrays_1.0.6 mime_0.12
[61] DropletUtils_1.20.0 units_0.8-5
[63] statmod_1.5.0 bluster_1.10.0
[65] interactiveDisplayBase_1.38.0 ellipsis_0.3.2
[67] bit64_4.0.5 filelock_1.0.3
[69] irlba_2.3.5.1 vipor_0.4.7
[71] KernSmooth_2.23-21 colorspace_2.1-0
[73] DBI_1.2.1 raster_3.6-26
[75] tidyselect_1.2.0 bit_4.0.5
[77] compiler_4.3.1 curl_5.2.0
[79] BiocNeighbors_1.18.0 DelayedArray_0.26.7
[81] scales_1.3.0 classInt_0.4-10
[83] rappdirs_0.3.3 goftest_1.2-3
[85] spatstat.utils_3.0-4 rmarkdown_2.25
[87] XVector_0.40.0 htmltools_0.5.7
[89] pkgconfig_2.0.3 base64enc_0.1-3
[91] sparseMatrixStats_1.12.2 fastmap_1.1.1
[93] htmlwidgets_1.6.4 shiny_1.8.0
[95] DelayedMatrixStats_1.22.6 farver_2.1.1
[97] jsonlite_1.8.8 BiocParallel_1.34.2
[99] R.oo_1.25.0 BiocSingular_1.16.0
[101] RCurl_1.98-1.14 magrittr_2.0.3
[103] GenomeInfoDbData_1.2.10 s2_1.1.6
[105] Rhdf5lib_1.22.1 munsell_0.5.0
[107] Rcpp_1.0.12 ggnewscale_0.4.9
[109] viridis_0.6.4 stringi_1.8.3
[111] leafsync_0.1.0 zlibbioc_1.46.0
[113] plyr_1.8.9 parallel_4.3.1
[115] ggrepel_0.9.5 deldir_2.0-2
[117] Biostrings_2.68.1 stars_0.6-4
[119] splines_4.3.1 tensor_1.5
[121] locfit_1.5-9.8 igraph_1.6.0
[123] ScaledMatrix_1.8.1 BiocVersion_3.17.1
[125] XML_3.99-0.16 evaluate_0.23
[127] BiocManager_1.30.22.3 httpuv_1.6.13
[129] purrr_1.0.2 polyclip_1.10-6
[131] scattermore_1.2 rsvd_1.0.5
[133] lwgeom_0.2-13 xtable_1.8-4
[135] e1071_1.7-14 RSpectra_0.16-1
[137] later_1.3.2 viridisLite_0.4.2
[139] class_7.3-22 tibble_3.2.1
[141] memoise_2.0.1 beeswarm_0.4.0
[143] AnnotationDbi_1.62.2 cluster_2.1.4